Let’s do some polling! Go to pollev.com/kumarr436
What do the students already know? You may be able to answer this from program structure, or through a survey.
Sometimes students will be coming in with different levels of background knowledge. You can try to address this before the session and/or adjust your teaching accordingly.
Where can students get background knowledge before the session?
learnr tutorialsRMarkdown (and RProjects)Motivation
Help students feel comfortable
I usually outline the core content by:
I strongly suggest an examples and exercises approach to teaching skills in R. We’ll practice this in a moment.
I always like to end with:
Open the RProject file and look in the working directory: you will see an exercises subdirectory and an answers subdirectory.
The following lesson snippets all use .R code files for the exercises. You can also ask students to use .Rmd, especially if this is part of a course where you will need to collect assignment submissions.
ggplot() functiongeom functions, such as geom_point() or geom_hist()aes() function nested within ggplot() or a geom function+ operator# Load data
gapminder <- gapminder::gapminder
# Look at the structure of the data. You can use glimpse(), summary(), or head().
glimpse(gapminder)## Rows: 1,704
## Columns: 6
## $ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
These will produce the same output:
ggplot(gapminder) +
geom_point(aes(x=year, y=pop)) +
labs(title="Population over time", x="Year", y="Population")Plot life expectancy as a function of GDP per capita for the year 2007, and add labels.
gapminder07 to ggplot()geom_point() + Supply x=gdpPercap and y=lifeExp to aes()title, x, and y in labs()ggplot(gapminder07) +
geom_point(aes(x=gdpPercap, y=lifeExp)) +
labs(title="Do people in richer countries live longer?", x="GDP per capita", y="Life expectancy")There are may geom functions we can choose to generate geometric objects:
Let’s try to add geom_smooth() to the previous plot we created.
ggplot(gapminder07, aes(x=gdpPercap, y=lifeExp)) +
geom_point() +
geom_smooth() +
labs(title="Do people in richer countries live longer?", x="GDP per capita", y="Life expectancy", subtitle="Gapminder 2007 data")## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(gapminder07) +
geom_point(aes(x=gdpPercap, y=lifeExp)) +
geom_hline(aes(yintercept=mean(lifeExp))) +
labs(title="Do people in richer countries live longer?", x="GDP per capita", y="Life expectancy", subtitle="Gapminder 2007 data")Plot the life expectancy of each continent in 2007.
Look at the ggplot cheatsheet and decide which kind of geom to use.
gapminder07 to ggplot()geom (e.g. geom_boxplot()) and supply appropriate aesthetics in nested aes() (e.g. x=continent, y=lifeExp)labs()ggplot(gapminder07) +
geom_boxplot(aes(x=continent, y=lifeExp)) +
labs(title="Distribution of life expectancy", x="Continent", y="Distribution")ggplot(gapminder07) +
geom_col(aes(x=continent, y=lifeExp)) +
labs(title="Distribution of life expectancy", x="Continent", y="Distribution")You can think of the continent or year variables as grouping variables: they place each observation in one of several groups.
We can represent the groups through aesthetic mapping or facets rather than along one of the axes.
ggplot(gapminder) +
geom_point(aes(x=gdpPercap, y=lifeExp, col=year)) +
labs(title="Do people in richer countries live longer?", x="GDP per capita", y="Life expectancy")ggplot(gapminder, aes(x=gdpPercap, y=lifeExp)) +
geom_point() +
geom_smooth() +
facet_wrap(~year) +
labs(title="Do people in richer countries live longer?", x="GDP per capita", y="Life expectancy")## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Try adding the argument scales="free" to the facet_wrap() layer.
ggplot(gapminder, aes(x=gdpPercap, y=lifeExp)) +
geom_point() +
geom_smooth() +
facet_wrap(~year, scales="free") +
labs(title="Do people in richer countries live longer?", x="GDP per capita", y="Life expectancy")## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Visualize life expectancy by continent in 2007 again. This time, group continents by color or facet.
ggplot(gapminder07, aes(x=gdpPercap, y=lifeExp,
col=continent, size=pop)) +
geom_point() +
labs(title="Life expectancy as a function of GDP per capita, by continent", x="GDP per capita", y="Life expectancy")ggplot(gapminder07, aes(x=lifeExp)) +
geom_density() +
facet_wrap(~continent, scales="free") +
labs(title="Life expectancy as a function of GDP per capita, by continent", x="GDP per capita", y="Life expectancy")Choose your own adventure!
Create a plot that includes two geoms and facets
# Load tidyverse and lubridate
library(tidyverse)
library(lubridate)
# Import vaccine data
vaccines <- read_csv("data/chicago_vaccines_daily.csv")##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## date = col_character(),
## doses = col_double(),
## first_dose = col_double(),
## final_dose = col_double()
## )
## Rows: 154
## Columns: 4
## $ date <chr> "12/15/2020", "12/16/2020", "12/17/2020", "05/16/2021", "05…
## $ doses <dbl> 16, 157, 1990, 4783, 4697, 5729, 3438, 936, 4385, 3, 3655, …
## $ first_dose <dbl> 16, 157, 1990, 1719, 1672, 5729, 3438, 936, 4385, 3, 3655, …
## $ final_dose <dbl> 0, 0, 0, 3140, 3120, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3074, 24…
In your exercise file, check the class of the date variable.
Check the class of the date variable:
## [1] "character"
This is just a character string. And it’s not even ordered correctly!
## [1] "12/15/2020" "12/16/2020" "12/17/2020" "05/16/2021" "05/17/2021"
## [6] "12/18/2020"
It looks like as_date() might be a helpful function from lubridate. But what happens when we use it?
## Warning: All formats failed to parse. No formats found.
## [1] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [26] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [51] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [76] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [101] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [126] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## [151] NA NA NA NA
Whoops! Let’s turn to the lubridate cheatsheet for help. What function should we use?
We can use the tailored mdy() function:
## [1] "2020-12-15" "2020-12-16" "2020-12-17" "2021-05-16" "2021-05-17"
## [6] "2020-12-18"
Or we can speficy the format of the values in the character string using the format= argument in as_date(). See the help file for strptime() for how to define formats.
## [1] "2020-12-15" "2020-12-16" "2020-12-17" "2020-05-16" "2020-05-17"
## [6] "2020-12-18"
# Replace the date variable in the dataset with the converted version
vaccines$date <- mdy(vaccines$date)
# Check the class of our converted variable
class(vaccines$date)## [1] "Date"
Now, let’s try to convert the date variable, which is in Date class, into a numeric value for month.
# Create a new variable for month. Refer to the cheatsheet for guidance.
vaccines$month <- month(vaccines$date)
# Repeat the plot, using month as the x-axis
ggplot(vaccines, aes(x=month, y=doses)) + geom_col()How about if we want to get the day of the week? Identify the right function using the lubridate cheatsheet. Check the help file to see if there are any useful arguments.
We can use wday(). Note the label=TRUE argument.
# Create a new variable for day of the week, with label=TRUE
vaccines$wday <- wday(vaccines$date, label=TRUE)
# Repeat the plot, using day of the week as the x-axis
ggplot(vaccines, aes(x=wday, y=doses)) + geom_col()Bonus exercise if time permits: Can you calculate the number of days it took for Chicago to fully vaccinate 1 million people?
Hint: You may need to use the function cumsum()
Let’s use the gapminder data that we are already familiar with to practice implementing some linear regressions and examining the results.
## Rows: 1,704
## Columns: 6
## $ country <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", …
## $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, …
## $ year <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, …
## $ lifeExp <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8…
## $ pop <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12…
## $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, …
Are life expectancy and GDP per capita related? A scatterplot suggests … maybe! We can investigate the relationship in a different way with a regression.
The lm() function in base R takes a formula and an argument specifying the data frame:
Let’s implement a regression with the DV lifeExp and one IV, gdpPercap. We’ll save the results as mod1. Look at what the object class is in your environment.
We can print out the results using summary() on the saved list object.
##
## Call:
## lm(formula = lifeExp ~ gdpPercap_perthou, data = gapminder)
##
## Residuals:
## Min 1Q Median 3Q Max
## -82.754 -7.758 2.176 8.225 18.426
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 53.95556 0.31499 171.29 <2e-16 ***
## gdpPercap_perthou 0.76488 0.02579 29.66 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.49 on 1702 degrees of freedom
## Multiple R-squared: 0.3407, Adjusted R-squared: 0.3403
## F-statistic: 879.6 on 1 and 1702 DF, p-value: < 2.2e-16
Implement a regression with the DV lifeExp and the IVs gdpPercap and year.
Then examine the results using summary().
##
## Call:
## lm(formula = lifeExp ~ gdpPercap_perthou + year, data = gapminder)
##
## Residuals:
## Min 1Q Median 3Q Max
## -67.262 -6.954 1.219 7.759 19.553
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -418.42426 27.61714 -15.15 <2e-16 ***
## gdpPercap_perthou 0.66973 0.02447 27.37 <2e-16 ***
## year 0.23898 0.01397 17.11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.694 on 1701 degrees of freedom
## Multiple R-squared: 0.4375, Adjusted R-squared: 0.4368
## F-statistic: 661.4 on 2 and 1701 DF, p-value: < 2.2e-16
Alternatively, we can visualize the coefficients and uncertainty using coefplot(). But the below isn’t very easy to read. What can we do to improve it? Check out the help file.
We can remove the intercept using the argument intercept=FALSE.
Predict life expectancy as a function of GDP per capita, year, and continent.
##
## Call:
## lm(formula = lifeExp ~ gdpPercap_perthou + year + continent,
## data = gapminder)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.4264 -4.0725 0.2154 4.4853 19.9977
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -520.67458 19.79081 -26.31 <2e-16 ***
## gdpPercap_perthou 0.29675 0.01996 14.87 <2e-16 ***
## year 0.28739 0.01000 28.73 <2e-16 ***
## continentAmericas 14.32676 0.49358 29.03 <2e-16 ***
## continentAsia 9.50561 0.45670 20.81 <2e-16 ***
## continentEurope 19.39554 0.51730 37.49 <2e-16 ***
## continentOceania 20.58592 1.46895 14.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.884 on 1697 degrees of freedom
## Multiple R-squared: 0.717, Adjusted R-squared: 0.716
## F-statistic: 716.6 on 6 and 1697 DF, p-value: < 2.2e-16
Note that the character variable continent is treated as fixed effects. The excluded category is Asia
RMarkdown, RProjects, and RStudio itself.Examples from NU:
Other resources:
learnr interactive tutorials, from RStudioTake it to the next level (suggestions from Christina Maimone):